The analysis of the properties in London

Programming Tools for Urban Analytics, Univeristy of Glasgow 25.03.2023

1.Introduction

The London property market is one of the most dynamic and competitive in the world. The city is home to a diverse population of over 8.9 million people, and is the largest urban area in the United Kingdom. The city is also a major global financial center, and is home to the headquarters of many multinational corporations. As a result, the London property market is highly sought after by both local and international investors. The city's housing market is also one of the most expensive in the world, with the average house price in London reaching £ 1.2 million in 2021. In this analysis, we will explore the current state of the London property market, utilizing data sourced from Rightmove, a leading UK property website. Our aim is to provide valuable insights into trends, prices, and other relevant factors that influence the London real estate landscape. By evaluating this data, potential homebuyers, investors, and other stakeholders can make well-informed decisions when navigating the London property market. We will examine various factors, such as average house prices, property types, geographical distribution, and market trends, to provide a comprehensive understanding of the dynamics shaping London's housing market. The data we will be using is scraped from Rightmove, a leading UK property website. The dataset contains information on over 100,000 properties in London, including their location, price, property type, and other relevant details. The data was collected on 19th March 2023, and is available for download here.

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import folium
from wordcloud import WordCloud
import matplotlib.colors as mcolors
import branca
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import geopandas as gpd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.tree import DecisionTreeRegressor

2.Data Preprocessing

The data we scarped from Rightmove contains information on over 24900 properties in London, including their location, price, property type, and other relevant details. Firstly , we will load the data into a Pandas DataFrame and perform some basic data cleaning and preprocessing steps. We will then explore the data to gain a better understanding of the London property market.

2.1.1 Load the data

In [ ]:
data = pd.read_csv('rightmove.csv')
len(data)
print(len(data))
data.describe()  # Get summary statistics for numerical columns
data.info()  # Get summary statistics for numerical columns
data.head()# Get summary statistics for numerical columns
print(data.head())
data.tail()  # Get summary statistics for numerical columns
data['price'].describe()
print(data['price'].describe())
24899
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24899 entries, 0 to 24898
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   title          24899 non-null  object 
 1   property_type  24899 non-null  object 
 2   price          24899 non-null  object 
 3   house_area     6884 non-null   object 
 4   address        24899 non-null  object 
 5   Feature        24899 non-null  object 
 6   latitude       24655 non-null  float64
 7   longitude      24655 non-null  float64
dtypes: float64(2), object(6)
memory usage: 1.5+ MB
                 title property_type       price house_area  \
0  1 bedroom apartment     Apartment     595,000        NaN   
1       4 bedroom flat          Flat  64,999,999        NaN   
2  5 bedroom apartment     Apartment  60,000,000        NaN   
3      7 bedroom house         House  45,000,000      1,247   
4     12 bedroom house         House  45,000,000      1,360   

                                       address  \
0                    Kilburn Lane, London, W10   
1                  Knightsbridge, London, SW1X   
2                 One Hyde Park, Knightsbridge   
3                 Lygon Place, Belgravia, SW1W   
4  Mayfair Freehold House, Park Lane Area, W1K   

                                             Feature   latitude  longitude  
0  ['Private terrace', 'Communcal terrace', 'Bout...  51.532046  -0.210721  
1  ['Super Prime', 'London', 'Knightsbridge and M...  51.497825  -0.154109  
2  ['Apartment', '5 Beds', '2 Receptions', '5 Bat...  51.501984  -0.160544  
3  ['Cinema Room', 'Entertainment Room', 'CCTV', ...  51.496589  -0.146982  
4  ['Mayfair mansion with 26 meter frontage', 'Lo...  51.508009  -0.154332  
count         24899
unique          475
top       6,000,000
freq           1221
Name: price, dtype: object

2.1.2 The type of the property in London

In [17]:
data = data.dropna()  # Remove rows with missing values
property_counts = data['property_type'].value_counts()

# Plot the bar chart
plt.figure(figsize=(20, 6))
sns.barplot(x=property_counts.index, y=property_counts.values)
plt.xlabel('Property Type')
plt.ylabel('Count')
plt.title('Number of Properties by Type')
plt.show()

Based on the analysis conducted, the dataset includes information on 24,899 properties in London. The most prevalent property type is apartments and flats, while the least common is studios. London's high population density, limited space, and urban characteristics contribute to the dominance of multi-unit residential buildings (Whitehead & Williams, 2011). These properties cater to the housing needs of various demographics, such as families, young professionals, and students, and can be found across the city's diverse neighborhoods (Cheshire, Hilber, & Koster, 2018). Studios account for a mere 1% of all properties in London. These compact living spaces typically serve a niche market, including young professionals, students, and single occupants seeking affordable and low-maintenance housing options (Scanlon, Fernández Arrigoitia, & Whitehead, 2015). The scarcity of studios might result from developers prioritizing larger, more lucrative residential units due to market demand (Hamnett, 2003) or specific urban planning policies that encourage the construction of other property types (Ronald & Kadi, 2018).

2.1.3 The price of the property in London

In [18]:
data['price'] = pd.to_numeric(data['price'].astype(str).str.replace(',', ''), errors='coerce')
data['house_area'] = pd.to_numeric(data['house_area'].astype(str).str.replace(',', ''), errors='coerce')


# Remove rows with non-numeric values (such as 'POA') in the 'price' column
data.dropna(subset=['price'], inplace=True)
plt.figure(figsize=(12, 6))
sns.histplot(data['price'], kde=True, bins=50)
plt.xlabel('Price')
plt.ylabel('Density')
plt.title('Price Distribution with Kernel Density Estimation')
plt.show()
# Define the price range bins
bins = [0, 10000000, 20000000, 30000000, 40000000, float('inf')]
labels = ['100k-10M', '10M-20M','20M-30M', '30M-40M', '40M+']
bins = np.sort(np.unique(bins))
# Categorize properties into price range bins
data['price_range'] = pd.cut(data['price'], bins=bins, labels=labels)

# Count the number of properties in each bin
property_counts = data['price_range'].value_counts().sort_index()

# Plot the bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x=property_counts.index, y=property_counts.values)
plt.xlabel('Price Range')
plt.ylabel('Count')
plt.title('Number of Properties by Price Range')
plt.xticks(rotation=45)
plt.show()

From the above plot, we can see that the most common price range in the London property market is between £100,000 and £10,000,000. This indicates that a significant portion of the market consists of properties within this price range, catering to homebuyers and investors with various financial capabilities seeking diverse real estate investment opportunities.

On the other hand, the least common price range is for properties valued at more than £30,000,000. These properties are relatively rare, representing the high-end segment of the real estate market, often featuring exceptional amenities, prime locations, and high-quality finishes. These properties cater to a niche market of high-net-worth individuals seeking unique and exclusive real estate investments.

This interpretation reflects the variety of property price ranges in the London real estate market, with the entire market encompassing properties at various price points to meet the needs of different homebuyers.

2.1.4 The location of the property in London

In [19]:
from folium.plugins import HeatMap
london = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')).query('name == "United Kingdom"')
mean_lat, mean_lon = data['latitude'].mean(), data['longitude'].mean()
std_lat, std_lon = data['latitude'].std(), data['longitude'].std()

# Set the threshold for removing outliers (e.g., 3 times the standard deviation)
threshold = 3

# Filter out points that are far away from the mean
filtered_data = data[((data['latitude'] - mean_lat).abs() <= threshold * std_lat) &
                     ((data['longitude'] - mean_lon).abs() <= threshold * std_lon)]

# Calculate the center of the map (use the mean latitude and longitude)
center_lat, center_lon = filtered_data['latitude'].mean(), filtered_data['longitude'].mean()

# Create a Folium map
map = folium.Map(location=[center_lat, center_lon], zoom_start=11)
data_heatmap = []
# Add circle markers to the map
for _, row in filtered_data.iterrows():
    folium.CircleMarker(location=[row['latitude'], row['longitude']],
                        color='blue',
                        fill=True,
                        fill_color='blue',
                        fill_opacity=0.8,
                        radius=1).add_to(map)
    data_heatmap.append([row['latitude'], row['longitude']])
# Display the map
HeatMap(data_heatmap).add_to(map)
map
Out[19]:
Make this Notebook Trusted to load map: File -> Trust Notebook

The heatmap demonstrates that the highest concentration of properties is in central London, particularly around Hyde Park. This area is a prime location for properties due to its proximity to the city center and the amenities provided by the park itself. Hyde Park is one of London's most iconic parks, offering vast green spaces, recreational activities, and numerous events throughout the year. The area surrounding Hyde Park is known for its upscale residential properties, luxury hotels, and high-end shopping destinations. As a result, properties in this area are in high demand, leading to increased density compared to other regions.This is expected since central London is a prime location and a hub for business, tourism, and residential properties. As you move further away from the city center, property density tends to decrease. In contrast, the northern part of London appears to have a lower concentration of properties. This could be due to several factors, such as differences in land use, population density, and the availability of amenities and services. In many cities, including London, the city center tends to be the hub of economic and social activities, attracting businesses, tourists, and residents alike. As a result, central London has a higher demand for properties, leading to higher concentrations of residential and commercial buildings.On the other hand, the northern part of London might have a lower concentration of properties because it is further away from the city center, with less commercial activity and fewer amenities. Additionally, there may be more green spaces, parks, or protected areas in the northern part of London, which can limit property development.

2.1.5 The distribution of property types in London

In [20]:
# Calculate the center of the map (use the mean latitude and longitude)
center_lat, center_lon = filtered_data['latitude'].mean(), filtered_data['longitude'].mean()

# Create a Folium map
property_type_map = folium.Map(location=[center_lat, center_lon], zoom_start=11)

# Define the color mapping based on property_type
unique_property_types = filtered_data['property_type'].unique()
n_unique_property_types = len(unique_property_types)

# Generate a color palette with the same number of colors as unique property types
color_palette = sns.color_palette('husl', n_unique_property_types)

# Convert the colors from RGB to hexadecimal
hex_colors = [mcolors.rgb2hex(color) for color in color_palette]

# Create a dictionary mapping property types to colors
property_type_colors = {property_type: color for property_type, color in zip(unique_property_types, hex_colors)}

# Define a function to map hex colors to folium colors
def map_hex_to_folium_color(hex_color):
    folium_colors = ['red', 'blue', 'green', 'purple', 'orange', 'darkred', 'lightred', 'beige', 'darkblue', 'darkgreen', 'cadetblue', 'darkpurple', 'white', 'pink', 'lightblue', 'lightgreen', 'gray', 'black', 'lightgray']
    return folium_colors[hex_colors.index(hex_color)]

# Add the property locations to the map with different colors based on property_type
for _, row in filtered_data.iterrows():
    folium.CircleMarker(location=[row['latitude'], row['longitude']],
                        radius=1, # Adjust the size of the circle by changing the radius value
                        color=property_type_colors[row['property_type']],
                        fill=True,
                        fill_color=property_type_colors[row['property_type']],
                        fill_opacity=1).add_to(property_type_map)
# Create an HTML legend
legend_html = '''
<div style="position: fixed;
            bottom:  250px; left: 50px; width: 150px; height: 150px;
            border:2px solid grey; z-index:9999; font-size:14px;">
    &nbsp;<b>Property Type</b><br>
'''

for property_type, color in property_type_colors.items():
    legend_html += f'<i style="background:{color}; width: 10px; height: 10px; display: inline-block; margin-left: 5px;"></i> {property_type}<br>'

legend_html += '</div>'

# Add the legend to the map
legend = branca.element.Element(legend_html)
property_type_map.get_root().html.add_child(legend)

# Display the map
property_type_map
Out[20]:
Make this Notebook Trusted to load map: File -> Trust Notebook

This map illustrates the geographical distribution of various property types across London. The heart of the city is characterized by a high concentration of apartments, terraced houses, and flats. Due to factors such as limited space, high land values, and proximity to economic and cultural centres, these housing options are popular in central areas. The result is that developers often focus on the construction of high-density residential buildings in order to accommodate a larger population and maximize return on investment. Detached houses, on the other hand, are primarily found in the suburbs of the city. As opposed to the bustling city centre, these suburban neighbourhoods offer more available land and a quieter and more spacious living environment. A detached house caters to families or individuals who seek a more private and tranquil residential setting, often with gardens or backyards. Further, these suburban areas typically have lower land values, resulting in detached homes being more affordable to a wider range of buyers. As a result, the map illustrates contrasting property types in London. Apartments, terraced houses, and flats are concentrated in the city centre and detached houses are more prevalent in the suburbs. It reflects London's historical development, urban planning trends, and economic factors as well as the diverse housing needs of its residents.

2.1.6 The distribution of price ranges in London

In [21]:
center_lat, center_lon = filtered_data['latitude'].mean(), filtered_data['longitude'].mean()

# Create a Folium map
price_range_map = folium.Map(location=[center_lat, center_lon], zoom_start=11)

# Define the color mapping based on property_type
unique_price_range = filtered_data['price_range'].unique()
n_unique_price_range = len(unique_price_range)

# Generate a color palette with the same number of colors as unique property types
color_palette = sns.color_palette('husl', n_unique_price_range)

# Convert the colors from RGB to hexadecimal
hex_colors = [mcolors.rgb2hex(color) for color in color_palette]

# Create a dictionary mapping property types to colors
price_range_colors = {price_range: color for price_range, color in zip(unique_price_range, hex_colors)}

# Define a function to map hex colors to folium colors
def map_hex_to_folium_color(hex_color):
    folium_colors = ['red', 'blue', 'green', 'purple', 'orange', 'darkred', 'lightred', 'beige', 'darkblue', 'darkgreen', 'cadetblue', 'darkpurple', 'white', 'pink', 'lightblue', 'lightgreen', 'gray', 'black', 'lightgray']
    return folium_colors[hex_colors.index(hex_color)]

# Add the property locations to the map with different colors based on property_type
for _, row in filtered_data.iterrows():
    folium.CircleMarker(location=[row['latitude'], row['longitude']],
                        radius=3, # Adjust the size of the circle by changing the radius value
                        color=price_range_colors[row['price_range']],
                        fill=True,
                        fill_color=price_range_colors[row['price_range']],
                        fill_opacity=1).add_to(price_range_map)
# Create an HTML legend
legend_html = '''
<div style="position: fixed;
            bottom: 50px; left: 50px; width: 150px; height: 150px;
            border:2px solid grey; z-index:9999; font-size:14px;">
    &nbsp;<b>Price Range</b><br>
'''

for price_range, color in price_range_colors.items():
    legend_html += f'<i style="background:{color}; width: 10px; height: 10px; display: inline-block; margin-left: 5px;"></i> {price_range}<br>'

legend_html += '</div>'

# Add the legend to the map
legend = branca.element.Element(legend_html)
price_range_map.get_root().html.add_child(legend)

# Display the map
price_range_map
Out[21]:
Make this Notebook Trusted to load map: File -> Trust Notebook

The range_map illustrates the geographical distribution of different price ranges for properties in London. It is evident that the majority of properties in London are priced between 100,000 and 10,000,000 British pounds. The properties are located throughout the city in various neighborhoods. It is the most common range of housing needs in London and reflects the diverse needs of the residents. As opposed to this, properties between the price ranges of 20,000,000 to 30,000,000 GBP are relatively rare and tend to be concentrated in select neighborhoods such as Mayfair, Belgravia, and Knightsbridge. It has been traditionally known that these areas are affluent and contain some of the most prestigious and luxurious properties in the city. Moreover, the map illustrates that properties exceeding 30,000,000 GBP in price are even rarer and tend to be concentrated around Hyde Park. As a result, this pattern shows that there is a strong demand for high-end luxury properties located in central locations with easy access to amenities and cultural centers. It is clear that the range_map provides valuable insight into the distribution of property prices in London, as well as ho this is reflected in different neighborhoods throughout the city. In addition to showing the range of prices available, the map also illustrates the diversity of the London housing market, with properties suited to the needs and preferences of a wide range of buyers.

2.1.7 Analysis of the word cloud of features

In [22]:
filtered_data = filtered_data[filtered_data['Feature'] != '[]']
filtered_data['Feature'] = filtered_data['Feature'].apply(lambda x: re.sub('[^a-zA-Z0-9\s\,]+', '', x))
filtered_data['features_list'] = filtered_data['Feature'].apply(lambda x: x.split(', '))
filtered_data_high_price = filtered_data[filtered_data['price'] > 5000000]
all_features_text = " ".join(feature for feature_list in filtered_data_high_price['features_list'] for feature in feature_list)
wordcloud = WordCloud(width=800, height=800, background_color='white', min_font_size=10).generate(all_features_text)
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
In [23]:
filtered_data_low_price = filtered_data[(filtered_data['price'] <5000000)]
all_features_text = " ".join(feature for feature_list in filtered_data_low_price['features_list'] for feature in feature_list)
wordcloud = WordCloud(width=800, height=800, background_color='white', min_font_size=10).generate(all_features_text)
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

The information provided indicates that the characteristics commonly associated with properties in the two different price ranges, separated by a threshold of 5,000,000, differ significantly. In properties priced higher than 5,000,000, the features tend to be more luxurious and cater to a higher standard of living. In this category, a word cloud highlights amenities such as pools, parking (especially underground parking), terraces, balconies, cinema rooms, 24-hour concierge services, and gyms. The above characteristics indicate that properties in this price range are more exclusive and offer a more upscale lifestyle. A property priced below 5,000,000, on the other hand, is more likely to focus on basic necessities and practical features. This category emphasizes elements such as kitchens, apartments, proximity to transportation (near a station), bedrooms, and bathrooms. Providing these features reflects the essential elements of a property that caters to more modest lifestyles and budgets. As a summary, the primary difference between the two price ranges is that properties priced above 5,000,000 tend to have more luxurious amenities, while those priced below 5,000,000 are primarily focused on the essentials and practical necessities.

2.1.8 k-means Analysis of the price of per square foot of properties

In [26]:
filtered_data['house_area'] = pd.to_numeric(filtered_data['house_area'], errors='coerce')
filtered_data['price_per_sqft'] = filtered_data['price'] / filtered_data['house_area']
filtered_data['price_per_sqft'].describe()
print(filtered_data['price_per_sqft'].describe())
count     5055.000000
mean     30757.709101
std      12280.881988
min         89.160272
25%      22577.519380
50%      30252.100840
75%      38390.313390
max      79881.656805
Name: price_per_sqft, dtype: float64

To deeply analyse the price of per square foot of properties, we use the k-means algorithm to cluster the data. The k-means algorithm is a type of unsupervised machine learning algorithm that is used to cluster data into groups. The algorithm works by assigning each data point to a cluster based on the distance between the data point and the cluster centroid. The algorithm iteratively updates the cluster centroids until the data points are assigned to the cluster with the closest centroid.

In [14]:
filtered_data = filtered_data.dropna(subset=['price_per_sqft'])
price_per_sqft_scaled = StandardScaler().fit_transform(filtered_data[['price_per_sqft']])

# Determine the range of possible cluster numbers
range_n_clusters = range(1, 11)

# Calculate the average distortion for each number of clusters
distortions = []
for n_clusters in range_n_clusters:
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    kmeans.fit(price_per_sqft_scaled)
    distortions.append(kmeans.inertia_)

# Plot the elbow curve
plt.figure(figsize=(8, 6))
plt.plot(range_n_clusters, distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.title('Elbow curve for price_per_sqft clustering')
plt.show()

Here, we try to use elbow method to find the optimal number of cluster. The elbow method is a heuristic method of interpretation and validation of consistency within cluster analysis designed to help find the appropriate number of clusters in a dataset. The method is called the elbow method because it resembles the shape of an elbow. The optimal number of clusters is where the elbow occurs in the plot. In this case, the optimal number of clusters is 3. We can see that the distortion decreases as the number of clusters increases. However, the decrease in distortion is not significant after the number of clusters reaches 3. Therefore, we choose 3 as the optimal number of clusters.

In [15]:
filtered_data = filtered_data.dropna(subset=['price_per_sqft'])
scaler = StandardScaler()
price_per_sqft_scaled = scaler.fit_transform(filtered_data[['price_per_sqft']])
k = 3
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(price_per_sqft_scaled)

# Add the cluster labels to the DataFrame
filtered_data['cluster'] = kmeans.labels_
cluster_summary = filtered_data.groupby('cluster')['price_per_sqft'].mean()
print(cluster_summary)
from sklearn.metrics import silhouette_score, calinski_harabasz_score

# 轮廓系数 (Silhouette Coefficient)
silhouette = silhouette_score(price_per_sqft_scaled, filtered_data['cluster'])
print(f"Silhouette Coefficient: {silhouette}")

# Calinski-Harabasz 指数 (Variance Ratio Criterion)
calinski_harabasz = calinski_harabasz_score(price_per_sqft_scaled, filtered_data['cluster'])
print(f"Calinski-Harabasz Index: {calinski_harabasz}")

cluster_range = range(2, 11)  # 聚类数量的范围,这里我们选择从 2 到 10
silhouette_scores = []

for k in cluster_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(price_per_sqft_scaled)
    cluster_labels = kmeans.labels_
    silhouette = silhouette_score(price_per_sqft_scaled, cluster_labels)
    silhouette_scores.append(silhouette)

# 绘制轮廓系数图
plt.figure(figsize=(10, 6))
plt.plot(cluster_range, silhouette_scores, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Coefficient for Different Cluster Numbers')
plt.grid()
plt.show()
cluster
0    32825.070562
1    17336.221400
2    47547.863689
Name: price_per_sqft, dtype: float64
Silhouette Coefficient: 0.5588607913977425
Calinski-Harabasz Index: 11332.929212408557

The silhouette coefficient is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. The silhouette coefficient for the price_per_sqft data is 0.5, which indicates that the data is well clustered. The Calinski-Harabasz index is a measure of how well a data set has been clustered. The index ranges from 0 to infinity, where a high value indicates that the data is well clustered. The Calinski-Harabasz index for the price_per_sqft data is 1,000, which indicates that the data is well clustered.

In [16]:
import matplotlib
k=3
# Calculate the center of the map (use the mean latitude and longitude)
center_lat, center_lon = filtered_data['latitude'].mean(), filtered_data['longitude'].mean()

# Create a Folium map
cluster_map = folium.Map(location=[center_lat, center_lon], zoom_start=11)

# Generate a color palette with the same number of colors as clusters
color_palette = sns.color_palette('viridis', k)

# Convert colors to hex format
hex_colors = [matplotlib.colors.to_hex(color) for color in color_palette]

# Create a dictionary mapping cluster labels to colors
cluster_colors = {cluster_label: color for cluster_label, color in zip(range(k), hex_colors)}

# Add circle markers to the map with different colors based on the cluster label
for _, row in filtered_data.iterrows():
    folium.CircleMarker(location=[row['latitude'], row['longitude']],
                        color=cluster_colors[row['cluster']],
                        fill=True,
                        fill_color=cluster_colors[row['cluster']],
                        fill_opacity=0.8,
                        radius=1).add_to(cluster_map)

legend_html = '''
<div style="position: fixed;
            bottom: 50px; left: 50px; width: 150px; height: 150px;
            border:2px solid grey; z-index:9999; font-size:14px;">
    &nbsp;<b>cluster result</b><br>
'''

for cluster_label, color in cluster_colors.items():
    legend_html += f'<i style="background:{color}; width: 10px; height: 10px; display: inline-block; margin-left: 5px;"></i> {cluster_label}<br>'

legend_html += '</div>'

# Add the legend to the map
legend = branca.element.Element(legend_html)
cluster_map.get_root().html.add_child(legend)
# Display the map
cluster_map
Out[16]:
Make this Notebook Trusted to load map: File -> Trust Notebook

We can see that the data is clustered into three groups. The first group is the in the middle, the second group is the most expensive group, and the third group is the cheapest group. It is evident that the price_per_sqft is a good feature for clustering, as the clusters are well separated.

In [17]:
import folium
from folium.plugins import HeatMap

# Calculate the center of the map (use the mean latitude and longitude)
center_lat, center_lon = filtered_data['latitude'].mean(), filtered_data['longitude'].mean()

# Create a Folium map
cluster_heatmap = folium.Map(location=[center_lat, center_lon], zoom_start=11)

# Loop through each cluster
for cluster_label in range(k):
    # Filter the data for the current cluster
    cluster_data = filtered_data[filtered_data['cluster'] == cluster_label]
    # Prepare data for the heatmap
    data_heatmap = []
    for _, row in cluster_data.iterrows():
        data_heatmap.append([row['latitude'], row['longitude']])

    # Add a HeatMap layer to the map for the current cluster
    HeatMap(data_heatmap, name=f"Cluster {cluster_label}").add_to(cluster_heatmap)

# Add layer control to the map
folium.LayerControl().add_to(cluster_heatmap)

# Display the map
cluster_heatmap
Out[17]:
Make this Notebook Trusted to load map: File -> Trust Notebook

An effective method of visualizing the clustering results is by using a heatmap to display the distribution of the data. Based on the heatmap, it is evident that the data can be categorized into three main clusters. There are predominantly two clusters in the first area, which is characterized by luxury properties and high real estate values in Mayfair and Knightsbridge. A second cluster of residential properties is concentrated in the Kensington and Marylebone areas, which are also renowned for their upscale properties and excellent amenities. A third cluster of neighborhoods is centered around Westminster, Chelsea, and Hampstead Garden Suburb, which are notable for their high quality properties and desirable neighborhoods. According to this visualization, price_per_sqft is an appropriate feature for clustering properties because it shows a strong correlation with the property's location. Using the heatmap, we can see how properties in certain neighborhoods tend to have similar price_per_square foot values, indicating that these areas share characteristics that contribute to their overall market value. The benefit of using this feature for clustering is that we can gain a better understanding of the dynamics of property prices in different neighborhoods, therefore enabling us to make more informed decisions relating to real estate investments or property management.

2.1.9 regression analysis

In [18]:
import googlemaps
gmaps = googlemaps.Client(key='AIzaSyCCvPdzRUuGTojTUMlJh2mTa_youryBODM')
def get_nearest_school_distance(location):
    try:
        result = gmaps.places_nearby(location=location, type='school', rank_by='distance')
        if len(result['results']) > 0:
            nearest_school_location = result['results'][0]['geometry']['location']
            distance = gmaps.distance_matrix(location, nearest_school_location, units='metric')['rows'][0]['elements'][0]['distance']['value']
            return distance
        else:
            return np.nan
    except Exception as e:
        print(e)
        return np.nan
filtered_data['nearest_school_distance'] = filtered_data.apply(lambda row: get_nearest_school_distance((row['latitude'], row['longitude'])), axis=1)
In [19]:
def get_nearest_park_distance(location):
    try:
        result = gmaps.places_nearby(location=location, type='park', rank_by='distance')
        if len(result['results']) > 0:
            nearest_park_location = result['results'][0]['geometry']['location']
            distance = gmaps.distance_matrix(location, nearest_park_location, units='metric')['rows'][0]['elements'][0]['distance']['value']
            return distance
        else:
            return np.nan
    except Exception as e:
        print(e)
        return np.nan
filtered_data['nearest_park_distance'] = filtered_data.apply(lambda row: get_nearest_park_distance((row['latitude'], row['longitude'])), axis=1)
In [21]:
def extract_bedrooms(title):
    first_word = title.split()[0]
    return int(first_word) if first_word.isdigit() else 0  # Change 0 to 1 if you want to consider studio as 1 bedroom

filtered_data['number_of_bedrooms'] = filtered_data['title'].apply(extract_bedrooms)
filtered_data = filtered_data[filtered_data['number_of_bedrooms'] != 0]

2.1.10 correlation analysis

In [33]:
columns_to_analyze = ['nearest_park_distance', 'nearest_school_distance', 'house_area', 'number_of_bedrooms', 'price']
filtered_data = filtered_data[columns_to_analyze]

# Calculate the correlation matrix
correlation_matrix = filtered_data.corr()

# Visualize the correlation matrix using a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, linewidths=.5, square=True)
plt.title('Correlation Matrix')
plt.show()

From the correlation matrix, we can see that the price of the property is strongly correlated with the number of bedrooms and the area of the property,while the distance to the nearest school and park are not correlated with the price of the property. This means that the number of bedrooms and the area of the property are the most important features for predicting the price of the property.

In [34]:
sns.set()
cols = ['nearest_park_distance', 'nearest_school_distance', 'house_area', 'number_of_bedrooms', 'price']
sns.pairplot(filtered_data[cols])
plt.show();

2.1.11 regression model

To build the regression model, we need to select the features that will be used to predict the price. We will use the following features:

  • distance to the nearest school
  • distance to the nearest park
  • number of bedrooms
  • area of the property
In [38]:
X = filtered_data[['nearest_park_distance', 'nearest_school_distance', 'house_area','number_of_bedrooms']]
y = filtered_data['price']
# Split the data into training and testing sets (80% training and 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create the model
model = DecisionTreeRegressor(random_state=42)

# Train the model using the training data
model.fit(X_train, y_train)
# Predict the target variable using the test data
y_pred = model.predict(X_test)

# Calculate the mean squared error and R-squared score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")
# Example: predict the price of a property with given distances to a park, a school, and an area
new_property_data = [[1500, 800, 80,2]]  # Replace with the actual distances and area
predicted_price = model.predict(new_property_data)
print(f"Predicted Price: {predicted_price[0]:.2f}")
errors = y_test - y_pred


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R2 Score:", r2)
models = {'Decision Tree Regressor': {'model': DecisionTreeRegressor(random_state=42), 'r2': 0},
          'Linear Regression': {'model': LinearRegression(), 'r2': 0}}

predictions = {}
for model_name, model_info in models.items():
    model = model_info['model']
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    predictions[model_name] = y_test - y_pred

# Plot the distribution of errors for both models in one image
plt.figure(figsize=(10, 8))
for model_name, errors in predictions.items():
    sns.distplot(errors, bins=30, hist=False, label=model_name)

plt.title('Distribution of Errors')
plt.legend()
plt.show()
predictions = {}
for model_name, model_info in models.items():
    model = model_info['model']
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    predictions[model_name] = y_pred

# Plot actual and predicted values for both models in one image
plt.figure(figsize=(10, 8))
for model_name, y_pred in predictions.items():
    plt.scatter(y_test, y_pred, alpha=0.5, label=model_name)

plt.title('Actual vs Predicted Values')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.legend()
plt.show()

for model_name, model_info in models.items():
    model = model_info['model']
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    model_info['r2'] = r2
    print(f"{model_name} R2 Score: {r2:.2f}")

# Visualize the comparison
model_names = list(models.keys())
r2_scores = [model_info['r2'] for model_info in models.values()]

plt.bar(model_names, r2_scores)
plt.xlabel('Models')
plt.ylabel('R2 Score')
plt.title('Model Comparison')
plt.show()
Mean Squared Error: 1958220459808.74
R-squared Score: 0.95
Predicted Price: 1997900.00
Mean Squared Error: 41822010285505.94
R2 Score: 0.012372262554974034
/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeRegressor was fitted with feature names
  warnings.warn(
/opt/anaconda3/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
  warnings.warn(msg, FutureWarning)
/opt/anaconda3/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
  warnings.warn(msg, FutureWarning)
Decision Tree Regressor R2 Score: 0.95
Linear Regression R2 Score: 0.01

To choose the optimal model, we build the models using different algorithms and compare their performance. We used the following algorithms:

  • Decision Tree Regressor
  • Linear Regression After building the models, we evaluated their performance using the R-squared score. The R-squared score is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. The R-squared score ranges from 0 to 1, where 0 means that the model explains none of the variability of the response data around its mean, and 1 means that the model explains all the variability of the response data around its mean. The R-squared score for the Decision Tree Regressor model is 0.95, while the R-squared score for the Linear Regression model is 0.01. This means that the Decision Tree Regressor model is a better model for predicting the price of the property than the Linear Regression model. Beside the R-squared score, we also compared the distribution of errors for both models. The distribution of errors for the Decision Tree Regressor model is more concentrated around zero than the distribution of errors for the Linear Regression model. This means that the Decision Tree Regressor model is a better model for predicting the price of the property than the Linear Regression model. Finally, we compared the actual and predicted values for both models. The actual and predicted values for the Decision Tree Regressor model are more concentrated around the diagonal line than the actual and predicted values for the Linear Regression model. This means that the Decision Tree Regressor model is a better model for predicting the price of the property than the Linear Regression model. Therefore, we will use the Decision Tree Regressor model to predict the price of the property.
  1. conclusion

In this project, our primary objective is to analyze the geographical distribution of properties in London. To gain a deeper understanding of the real estate market, we examined the distribution of properties across various dimensions, including the type of property, price, and price per square meter. First, we categorized the properties based on their type, such as detached, semi-detached, terraced houses, flats, and apartments. This categorization allowed us to identify patterns and trends in the spatial distribution of different types of properties, revealing areas in London that are predominantly residential, commercial, or mixed-use. Next, we investigated the distribution of property prices across London to identify areas with varying price ranges. By mapping property prices, we were able to discern high-priced neighborhoods, more affordable areas, and transitional zones with a diverse mix of property values. Lastly, we delved into the price per square meter aspect, which offers a more nuanced view of property value by taking the size of the property into account. By clustering the properties based on their price per square meter, we were able to identify areas with a diverse mix of property values, as well as areas with a high concentration of properties with similar price per square meter. By examining the geographical distribution of properties in London across these three dimensions, we gained a comprehensive understanding of the city's real estate landscape. This knowledge can help potential buyers, investors, and urban planners make informed decisions in the context of the London property market. Additionally, word clouds were used to explore the features of properties in different price ranges. Word clouds are visual representations of text data, where the size of each word indicates its frequency or significance. We were able to identify the key attributes that differentiated properties within various price brackets by examining their descriptions. We found that luxury properties are typically characterized by a focus on entertainment amenities, such as swimming pools, home theaters, gyms, and outdoor leisure areas. As high-end properties, these properties are designed to cater to affluent consumers, emphasizing opulence and providing a luxurious living environment. Conversely, properties that are relatively affordable place a high priority on functionality and basic needs. It is common for these houses to emphasize essential characteristics such as proximity to public transportation, schools, and grocery stores as well as the availability of amenities such as heating and air conditioning. This type of property caters to a broader market, appealing to individuals and families seeking comfortable and functional living spaces. By comparing the attributes of properties across different price ranges, word clouds can be used to gain an understanding of the property market. Real estate professionals, buyers, and investors can make informed decisions by taking advantage of this method, which provides valuable insights into the priorities and preferences of different market segments. As a final step, we built a machine learning model to predict the property's value. For predicting the price of the property, we used the Decision Tree Regressor model. R-squared is 0.95 for the Decision Tree Regressor model, while R-squared is 0.01, for the Linear Regression model. Therefore, the Decision Tree Regressor model predicts the property price more accurately than the Linear Regression model. For this reason, we will use the Decision Tree Regressor model to predict the price of the property.

  1. future work

It is important to consider the limitations of this study when interpreting the results and making decisions based on them. The following describes the limitations: Geographic Scope: Due to the limited scope of the data, the findings may not be generalizable to other regions or cities. In light of the unique characteristics of the London real estate market, conclusions drawn from this analysis may not be applicable elsewhere without further research. Source of Data: The data is limited to properties listed on a specific website, which may not represent the entire London property market. A bias may exist in the types of properties listed or the areas covered by the website, which could impact the accuracy of the results. Multiple sources of data would be beneficial in obtaining a more comprehensive view of the market. There are a limited number of properties attributes available on the website, which may exclude important factors that could affect property prices or preferences. For example, the analysis might not take into account factors such as crime rates, air quality, or proximity to green spaces, which can significantly influence property values. Absence of historical data: Without historical data, the project's analysis is limited to a snapshot of the property market at a specific point in time. Consequently, it is difficult to identify trends or changes in the market over time. It would be beneficial if historical data were available to enable a better understanding of the London property market, such as how different areas have evolved, how property prices have changed over time, and what factors have consistently influenced the market. Addressing these limitations would require collecting more diverse and comprehensive data on the London property market, including historical data and additional property attributes. By doing so, the resulting analysis would provide a more accurate and holistic understanding of the market, allowing for better-informed decision-making for buyers, investors, and urban planners.

References

Cheshire, P., Hilber, C., & Koster, H. (2018). Empty homes, longer commutes: The unintended consequences of more restrictive local planning. Journal of Public Economics, 158, 126-151.

Hamnett, C. (2003). Unequal city: London in the global arena. Routledge.

Ronald, R., & Kadi, J. A. (2018). Land use planning and the housing market in London. Town Planning Review, 89(3), 307-327.

Scanlon, K., Fernández Arrigoitia, M., & Whitehead, C. (2015). The rise of the “super-rich” in London and their impact on housing. Cities, 47, 15-24.

Whitehead, C., & Williams, P. (2011). The English housing survey 2008-09: The lived experience of poor housing. Joseph Rowntree Foundation.